Masthead

Creating Synthetic Data in R

To evaluate new methods and to diagnose problems with modeling processes, we often need to generate synthetic data. This allows us to precisely control the data going into our modeling methods and then check the output to see if it is as expected. In this lab, you'll use R to create point and raster data sets for use in trend surface and interpolation analysis.

1. Creating Random Grid Data

First, let's create a single array with some random data in R:

# Create a one dimensional array for our independent variable

NumEntries=100
X=1:NumEntries # create an array with 1,2,3,... through 100
plot(X)

# Create an array with random values as our dependent data

RandomMean=0 # this will be the mean of our random distribution
RandomStdDev=1 # this will be the standard deviation of the distribution

RandomValues=rnorm(NumEntries, mean = RandomMean, sd = RandomStdDev) 
Y=RandomValues # one number between -1 and 1

# Plot the Y values
plot(Y)

When you run the code above, you should see a line for the X values and a plot of random values between about -2 and 2 for Y. The last plot should show the same thing as the second plot. Why is this?

The random function does not create truly random numbers because computers are deterministic machines. However, for our purposes, these numbers will be just fine.

The code above uses the "rnom()" function which creates random values from a normal distribution. This is the most commonly used but there are other function in R to create random values from other distributions.

Now try different values for the mean and standard deviation.

Question 1: What effect does the mean and standard deviation have on the data?

Try the following:

# Plot X against Y 
plot(X,Y) 

The plot does not appear to change. Why is this?

The reason is that we are plotting X against Y but there is no relationship between X and Y. In other words, Y is not DEPENDENT on X.

In the rest of this last we'll refer to this data as a "error component" of the model and the parameters RandomMean and RandomStdDev as the "coefficients of the error component".

2. Create a Trend

A trend is another term for correlation where there is some trend in the data based on some phenomenon that we can measure. There is a large area of modeling that uses polynomial expressions to model phenomenon. As a review of polynomials, remember that the equation for a line is:

y=B0 + B1 * x

Where B0 (beta 0) is the slope of the line and B1 is the intercept. When we are doing regression, the "B0" represents the value of y when the independent variable is 0. The "B1" is than the relationship between x and y. Another way to say this is if "B1" is small, then y changes little as x changes, if "B1" is large, then y changes a lot as x changes. Since the exponent on "x" is one, this is referred to as a "first order" polynomial.

Note: When we fit a model to data, B1 and B0 are the "estimated parameters", also called "coefficients".

In statistics, we use B0 and B1 instead of m and b (or a and b) we can add B2, B3, etc. for additional independent variables.

Add the code below to create a trend and plot it.

# Setup the parameters for the trend

B0=1 # intercept
B1=1 # slope

Y=B0+B1*X

plot(Y)
abline(B0,B1)

Question 2: What effect does setting B1 to 10 have? What effect does setting B1 to -1 have?

Question 3: What effect does changing B0 have?

Try other values until you are comfortable creating linear data in R. In the code below, we'll refer to B0 and B1 as the "coefficients of the original trend". Together with the "coefficients of the error component" is becomes the "coefficients of the original data".

3. Combining a trend with random data

Add the code below to add a trend to the data and plot the result.

# Combine the trend and range data

Y=B0+B1*X + RandomValues

Question 4: What effect does increasing and decreasing the value of the standard deviation in the random function have?

Try different values for each of the coefficients of the original data until you are comfortable with the impact that random effects and linear trends have on data.

4. Evaluating a Trend Surface

Remember the "lm()" function from last weeks lab? The format for this function is:

# Create the model
RegressionModel=lm(Y~X)

# output the coefficients of the model
print(RegressionModel)

# output additional information on the model
summary(RegressionModel)

Where Y is the dependent variable and X is the independent variable. You can also add additional independent variables. See my "R" web site for how to interpret the outputs from "print(...)" and "summary(...)".

Try different models, plot and print them to see if R can recreate your original models. Now increase the number of values in your data set. Also, increase and reduce the magnitude of the standard deviation of the random component and examine whether the models improve with the addition of random data.

Note: Running lm() is the equivalent of running the "Trend" tool in ArcGIS. You'll find that the tools in Archaist tend to be easier to use while the tools in R have more flexibility.

Question 5: How well does R find the original coefficients of your original data?

Removing the trend

Now we can remove the trend from our data by simply subtracting a prediction from our "original data". To create a prediction from our model, we do need to convert our array into a data frame. Then, we can subtract our predictions from our model to find the residuals and histogram them.

# create a new array of x values for the covariates and convert it to a data frame
X1=1:100
X1 = (X1)

# create a prediction for Y based on our new X values and the fitted model
Y1=predict.lm(RegressionModel,newdata=X1) 
plot(Y1)

# subtract the original data points from the predicted values to find the residuals
Residuals=Y1-Y

# create a histogram and a qqplot of the residuals
hist(Residuals)
qqnorm(Residuals)

Create histograms for the original response values (Y), your predicted trend surface, and your residuals.

Question 6: How good a job did the prediction do at removing the trend in your original data? How does changing the coefficients of the original data impact the lm() function's ability to remove the trend?

5. Getting Control of Polynomials (Optional)

Add additional independent variables to the model to add higher order functions. Adding a square term makes the function "quadratic", cubing X makes it a cubic and so on. This is referred to as raising the "Degree of the Polynomial".

# Combine the trend and range data

Y=B0+B1*X+B2*X^2+B3*X^3 + RandomValues

Question 7: What effect does increasing and decreasing the values of B3 and B4? Remember to try negative numbers.

You may find that it is challenging to get anything other than a straight line or a single exponential curve. To see something more interesting, you'll need to think about what is happening with each piece of the equation. As you add the higher order coefficients, remember that they will have larger values so you'll need to increase the lower order coefficients for them to have an effect. Try making the lower order ones 10 times as large as the next-highest order coefficient.

The most important learning here is how challenging it is to have polynomials represent complex phenomena. Polynomials have their place but they are challenging to work with and typically do not respond in the way that natural spatial phenomena do. Over the next weeks, we'll be learning other techniques that use different mathematics to create spatial models.

Note that you can also add additional independent variables to a polynomial very easily. The general form for a multivariate linear (first order) equation is then:

y=B0+B1*x1+B2*x2+B3*x3...

Where B0 is the intercept and B1, B2, and B3 are the slope values ("m" from above) that determine how y responds to each x value.

The "lm()" function we have been using is named for "linear model" but it can actually create models for multidimensional, higher-order, polynomials. Update your model for the additional coefficients and see how well lm() performs.

6. Creating data with auto correlation (optional)

Another phenomenon in the real world is that things that are closer together tend to be more alike. This can be because of a trend that is from another phenomenon or because trees and other species tend to spread seeds near themselves more than far away. After we remove any trends, we want to understand if there is any auto correlation in the data.

Auto correlation is often a trend that has yet to be discovered

The gradient dataset from above is highly auto-correlated but this is also an easy trend to detect. Trigonometric functions (Sine and Cosine) can be used to create patterns of values that change spatially over a grid. Below is a method for adding some fake auto-correlated data.

# autocorrelated component

Frequency=0.1
Magnitude=20

Y=B0+B1*X + RandomValues + sin(X*Frequency)*Magnitude

plot(Y)

Change the frequency and magnitude of the auto correlation to see it's effect on the data.

Below is code for R that will compute a Moran's I statistic for a linear array.

# simplified Moran's i (one dimension)

TheMean=mean(Y) # find the overall mean of the array

# find the sum of squares 

SumOfSquares=0
for (i in 1:NumEntries ) 
{
  SumOfSquares=SumOfSquares+(Y[i]-TheMean)^2
}
print(SumOfSquares)

# find the sum of the weighted differences
TheWeights=0
TheSum=0
for (i in 1:NumEntries ) 
{
  for (j in 1:NumEntries ) 
  {
    if (i!=j) # don't compute differences between the same values
    {
      TheWeight=1/((i-j)^2) # compute a weight based on position in the array (i-j)
      TheWeights=TheWeights+TheWeight # add the weights to an overall sum of weights
      
	  # the key equation where were add the weighted differences from the mean and sum it
      TheSum=TheSum+TheWeight*(Y[i]-TheMean)*(Y[j]-TheMean)
    }
  }
}
# scale the sum to be from 0 to 1
MoransI=(NumEntries/TheWeights)*(TheSum/SumOfSquares)
print(MoransI) 

Question 8: What is the value of Moran's I?

To remove the auto correlation, we would need to use a semi-variogram to determine the amount of auto-correlation and then created a Kriged surface which we would subtract from our data. We do not have a tool to perform this on 1 dimensional data so we'll wait to tackle that.

8. Creating Two Dimensional Data

When we have two independent variables (aka multiple linear regression) we create a DataFrame in R which is just a table that is very similar to an attribute table in ArcGIS. There are three columns in the table, one for each independent variable and one for the dependent variable. The code below creates such a table where the dependent variable is a linear trend of two independent variables. Note that we have included the rgl library to create 3 dimensional plots. This is by far the best documentation I have found for 3D plotting with R.

library(rgl) # more flexible library than the built in 3D plotting library


# parameters

X1Dimension=100 # width of the surface in the x1 direction
X2Dimension=100 # width of the surface in the x2 direction

B0=4 # intercept at X1=0, X2=0
B1=10 # X1 factor
B2=1 # X2 factor


# setup the data frame with all 0 values

NumEntries=X1Dimension*X2Dimension

X1=rep(0,NumEntries)
X2=rep(0,NumEntries)
YValues=rep(0,NumEntries)

DataFrame=data.frame(YValues,X1,X2)


# set the array to have a trend surface

Index=0
for (x1 in 1:X1Dimension ) {
  for (x2 in 1:X2Dimension) {
    DataFrame$X1[Index]=x1
    DataFrame$X2[Index]=x2
    DataFrame$YValues[Index]=B0+B1*x1+B2*x2 # get y values from the equation
    Index=Index+1
  }
}
# plot the data

plot3d(DataFrame$X1, DataFrame$X2, DataFrame$YValues,size=1,xlab="x1",ylab="x2",zlab="y")

The code below will add some randomness into our trend data just as we did before and then plot the results.

# add a random component

RandomMean=0
RandomStdDev=80

for (Index in 1:NumEntries ) {
    RandomValue=rnorm(1, mean = RandomMean, sd = RandomStdDev); # generate the random value
    
    DataFrame$YValues[Index]=DataFrame$YValues[Index]+RandomValue # add the error into the data
}
# plot the result in 3D

plot3d(DataFrame$X1, DataFrame$X2,DataFrame$YValues, size=1,xlab="x1",ylab="x2",zlab="y")

Then, we can create a mulitple linear regression model in the same way we did before except by adding an additional indecent variable as below.

# Create the model
RegressionModel=lm( DataFrame$YValues~DataFrame$X1+DataFrame$X2)


# output the coefficients of the model
print(RegressionModel)


# output additional information on the model
summary(RegressionModel)

Plotting the model is a bit trickier. First, we have to get the model parameters out of the model. Then, we create a 2 dimensional matrix to represent our modeled trend and we fill it with values from our equation but using the modeled coefficients. Then we create two arrays that represent the range of the x1 and x2 variables for the axis of our chart. We can then plot our points with the rgl.points() function and add the trend surface with the rgl.surface() function.

###############################################
# Plot the model against the points

# Extract the coefficients from the model 
coefficients=RegressionModel$coefficients
EstimatedB0=coefficients[["(Intercept)"]]
EstimatedB1=coefficients[["DataFrame$X1"]]
EstimatedB2=coefficients[["DataFrame$X2"]]

# create a matrix (i.e. surface) with values based on the coefficients
EstimatedMatrix=matrix(1:NumEntries, nrow = X1Dimension, ncol = X2Dimension)

Index=0
for (x1 in 1:X1Dimension ) {
  for (x2 in 1:X2Dimension) {
    EstimatedMatrix[x1,x2]=EstimatedB0+EstimatedB1*x1+EstimatedB2*x2
    Index=Index+1
  }
}

# create arrays with the values for the x1 and x2 axis
PredX1=1:X1Dimension
PredX2=1:X2Dimension

# create the 3d plot
rgl.open() # create a new 3d plot
rgl.bg(color = "white") # Setup the background color
rgl.bbox(color=c("#333377","black"), emission="#333377",specular="#3333FF", shininess=5, alpha=0.8) # Add bounding box decoration
aspect3d(10,1,10) # make the x1, x2 axis larger
rgl.points(DataFrame$X1,DataFrame$YValues, DataFrame$X2, color = "red",size=1)

rgl.surface(PredX1,PredX2,EstimatedMatrix,color = "black", alpha = 0.8, lit = FALSE) # add the surface

Below is the code to add modeled values to the data frame and then find residuals (note: there may be an easier way to do this).

##########################################
DataFrame$YModelValues=rep(0,NumEntries)

Index=0
for (x1 in 1:X1Dimension ) {
    for (x2 in 1:X2Dimension) {
     	DataFrame$YModelValues[Index]=EstimatedB0+EstimatedB1*x1+EstimatedB2*x2
     	Index=Index+1
		}
}
Residuals= DataFrame$YModelValues- DataFrame$YValues
hist(Residuals,breaks=40)

© Copyright 2018 HSU - All rights reserved.